A Probabilistic Deduplication, Record Linkage and Geocoding System
نویسندگان
چکیده
In many data mining projects in the health sector information from multiple data sources needs to be cleaned, deduplicated and linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient. Most of the time the linkage process is challenged by the lack of a common unique entity identifier. Additionally, personal information, like names and addresses, are frequently recorded with typographical errors, can be formatted differently, and parts can even be missing or swapped, making the duplication or linkage task non-trivial. A special case of linkage is geocoding, the process of matching user records with geocoded reference data, allowing spatial data analysis and mining, for example of disease outbreaks, or correlations with environmental factors. In this paper we present an overview of the Febrl (Freely extensible biomedical record linkage) project, which aims at developing improved algorithms and techniques for large scale data cleaning and standardisation, record linkage, deduplication and geocoding. We discuss new probabilistic techniques for data cleaning and standardisation, approximate geocode matching, parallelisation of blocking and linkage algorithms, as well as a probabilistic data set generator. Record Linkage and Geocoding in Health The health sector produces and collects massive amounts of data on a daily basis, including administrative Medicare and PBS data, emergency and hospital admission data, clinical data, as well as data collected in special databases like cancer registries. The mining of such data has attracted interest both from academia and governmental organisation. Often data from various sources needs to be integrated and linked in order to allow more detailed analysis. In health surveillance systems linked data can also help to enrich data that is used for pattern detection in data mining systems. Linked data also allows re-using of existing data sources for new studies, and to reduce costs and efforts in data acquisition for research studies. Linked data might contain information which is needed to improve health policies, and which traditionally has been collected with time consuming and expensive survey methods. Of increasing interest in the health sector is geocoding, the linking of a data source with geocoded reference data (which is made of cleaned and standardised records containing address information plus their geographical location). The US Federal Geographic Data Committee estimates that geographic location is a key feature in 80% to 90% of governmental data collections [29]. In many cases, addresses are the key to spatially enable data. The aim of geocoding is to generate a geographical location (longitude and latitude) from street address information in the user data. Once geocoded, the data can be used for further processing, in spatial data mining projects, and it can be visualised and combined with other data using geographical information systems (GIS). The applications of spatial data analysis and mining in the health sector are widespread. For example, geocoded data can be used to find local clusters of disease. Environmental health studies often rely on GIS and geocoding software to map areas of potential exposure and to locate where people live in relation to these areas. Geocoded data can also help in the planning of new health resources, e.g. additional health care providers can be allocated close to where there is an increased need for services. An overview of geographical health issues is given in [4]. When combined with a street navigation system, accurate geocoded data can assist emergency services find the location of a reported emergency. In this paper we present an overview of the Febrl (Freely extensible biomedical record linkage) project, and we discuss our future research plans. Febrl is implemented in the object-oriented open source language Python (which is open source itself) and available from the project web page. Due to the availability of its source code, Febrl is an ideal platform for the rapid development, implementation, and testing of new and improved record linkage algorithms and techniques. A Short Overview of Record Linkage If unique entity identifiers or keys are available in all the data sets to be linked, then the problem of linking or deduplication at the entity level becomes trivial, a simple join operation in SQL or its equivalent is all that is required. However, in most cases no unique identifiers are shared by all of the data sets, and more sophisticated linkage techniques need to be applied. These techniques can be broadly classified into deterministic or rules-based approaches (in which sets of often very complex rules are used to classify pairs of records as links, i.e. relating to the same entity, or as non-links), and probabilistic approaches (in which statistical models are used to classify record pairs). Probabilistic methods can be further divided into those based on classical probabilistic record linkage theory as developed by Fellegi & Sunter [11], and newer approaches using machine learning techniques [6, 9, 10, 13, 15, 19, 21, 28, 30]. Computer-assisted record linkage goes back as far as the 1950s, when most linkage projects were based on ad hoc heuristic methods. The basic ideas of probabilistic record linkage were introduced by Newcombe & Kennedy [22] in 1962 while the theoretical foundation was provided by Fellegi & Sunter [11] in 1969. The basic idea is to link records by comparing common attributes, which include person identifiers (like names and dates of birth) and demographic information. Pairs of records are classified as links if their common attributes predominantly agree, or as non-links if they predominantly disagree. If two data sets A and B are to be linked, record pairs are classified in a product space A × B into M , the set of true matches, and U , the set of true non-matches. Fellegi & 1 See: http://www.python.org Sunter [11] considered ratios of probabilities of the form
منابع مشابه
Probabilistic Deduplication, Record Linkage and Geocoding
Outline Background and illustrative example Record linkage Applications, privacy and ethics Our project and our tools Data cleaning and standardisation Probabilistic data standardisation and HMMs Blocking / indexing Record pair classification Geocoding Outlook Peter Christen, May 2005 – p.2/28
متن کاملProbabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملProbabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering
Probabilistic record linkage, the task of merging two or more databases in the absence of a unique identifier, is a perennial and challenging problem. It is closely related to the problem of deduplicating a single database, which can be cast as linking a single database against itself. In both cases the number of possible links grows rapidly in the size of the databases under consideration, and...
متن کاملA Probabilistic Geocoding System based on a National Address File
It is estimated that between 80% and 90% of governmental and business data collections contain address information. Geocoding – the process of assigning geographic coordinates to addresses – is becoming increasingly important in many application areas that involve the analysis and mining of such data. In many cases, address records are captured and/or stored in a free-form or inconsistent manne...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005